This document explains the methodology of the correlations analysis for San Jose social distancing compliance and summarizes key results. It uses data on social distancing updated on 5/4/2020.

library(tidyverse)
library(plotly)
library(sf)
library(mapview)
library(tigris)
library(censusapi)
library(leaflet)
library(lehdr)
library(usmap)


options(
  tigris_class = "sf",
  tigris_use_cache = TRUE
)

Methodology

The data used for social distancing compliance comes from Safegraph’s social distancing dataset. In particular, we used the data on devices “completely at home,” which Safegraph defines as not having left their usual nighttime location (see documentation here https://docs.safegraph.com/docs/social-distancing-metrics). For each census block group in San Jose, we calculated the average percent of devices completely at home on weekdays since the start of the Bay Area shelter-in-place order (3/16/2020), as well as the percent of devices completely at home on weekdays during the months of January and February 2020, prior to the shelter-in-place order and widespread COVID-19 concerns. From these results, we obtain the percent of devices leaving home during these time periods.

In this correlations analysis, we assess social distancing compliance in two ways: directly using the percent of devices leaving home (raw movement metric) and using the increase in percent of devices staying completely at home after the shelter-in-place order relative to prior to the order (stay-at-home increase metric). Though the raw movement metric provides insight into overall social distancing behavior, the reduction metric indicates the ability of a community alter their behavior to comply with the shelter-in-place order. We examine correlations between both these metrics of social distancing and various demographic variables, including income, age distribution, language ability, race, ethnicity, education level, vehicle ownership, occupants per room in a household, percent of workers who are male, and high speed internet access.

Key results

Here we display the most significant results that we obtained, focusing on variables with the relatively strongest correlations with social distancing behavior. We present plots showing the correlations between social distancing compliance and individual demographic variables for a few key variables, and provide the multiple regression analyses that we found to offer the most insight into the data.

Raw movement metric

Individual variables: Income

We considered a variety of income ranges, and found that breaking the population down by the percent of households with incomes over 125,000 was the most revealing.

# load data
sj_dem_distancing_pre_post <- readRDS("/Users/simonespeizer/Documents/2020 Spring Quarter/CEE 218Z/covid19/Simone_Speizer/sj_socialdistancing_demdata_prepostdifs_manyvars.rds")

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `% over 125,000`,
  y = `% not completely at home`
)) + geom_point() + 
  geom_smooth(method=lm) +
  labs(
    x = "Percent of housholds with incomes over $125,000 annually",
    y = "Percent of devices leaving home on weekdays since shelter-in-place",
    title = "San Jose: Social Distancing and Households Above $125,000"
  )

income_125_model <- lm(`% not completely at home` ~ `% over 125,000`, sj_dem_distancing_pre_post)
summary(income_125_model)
## 
## Call:
## lm(formula = `% not completely at home` ~ `% over 125,000`, data = sj_dem_distancing_pre_post)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.3864  -4.6201  -0.6336   4.0545  31.0130 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.43673    0.72846   82.97   <2e-16 ***
## `% over 125,000` -0.21440    0.01609  -13.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.308 on 567 degrees of freedom
## Multiple R-squared:  0.2384, Adjusted R-squared:  0.2371 
## F-statistic: 177.5 on 1 and 567 DF,  p-value: < 2.2e-16

The model output shown above indicates the fit of a linear model between percent of households with incomes over 125,000 and percent of devices leaving home on weekdays since shelter-in-place. The column labeled “Estimate” offers the slope of this model, indicating the change in percent of devices leaving home if the percent of households with incomes over 125,000 increases by 1%. The R-squared value assess the degree to which the independent variable (income) predicts the variation in the dependent variable (devices leaving home). In this case, then, we see that an increase by 1% in the percent of households with incomes over 125,000 tends to lead to a decrease of about 0.2 in percent of devices leaving home, and this model predicts about 24% of the variability in the devices leaving home. We can also compare this to the behavior prior to the shelter-in-place order.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `% over 125,000`,
  y = `% not completely at home pre shelter`
)) + geom_point() + 
  geom_smooth(method=lm) +
  labs(
    x = "Percent of housholds with incomes over $125,000 annually",
    y = "Percent of devices leaving home on weekdays pre-shelter-in-place",
    title = "San Jose: Social Distancing and Households Below $125,000 Pre Shelter-in-Place"
  )

income_125_model2 <- lm(`% not completely at home pre shelter` ~ `% over 125,000`, sj_dem_distancing_pre_post)
summary(income_125_model2)
## 
## Call:
## lm(formula = `% not completely at home pre shelter` ~ `% over 125,000`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.3546  -2.5692   0.0054   2.5391  16.4936 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      73.706412   0.423488  174.05   <2e-16 ***
## `% over 125,000`  0.095281   0.009356   10.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.249 on 567 degrees of freedom
## Multiple R-squared:  0.1546, Adjusted R-squared:  0.1531 
## F-statistic: 103.7 on 1 and 567 DF,  p-value: < 2.2e-16

The trend that was observed in the post-shelter-in-place data is the reverse of the one in the pre-shelter-in-place data; namely, before shelter-in-place, higher income was slightly associated with leaving the home, while after shelter-in-place, higher income is associated with less leaving the home.

Individual variables: Education

We examine the percent of individuals in a blockgroup that have a degree at the Associate’s level or higher and its correlation with percent of devices leaving the home in that blockgroup. We again start with after the shelter-in-place order.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `percent associates or higher`,
  y = `% not completely at home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of people with an degree at Associate's level or higher",
    y = "Percent of devices leaving home on weekdays since shelter-in-place",
    title = "San Jose: Social Distancing and Education"
  )

educ_model <- lm(`% not completely at home` ~ `percent associates or higher`, sj_dem_distancing_pre_post)
summary(educ_model)
## 
## Call:
## lm(formula = `% not completely at home` ~ `percent associates or higher`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.948  -4.702  -0.839   3.546  42.795 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    60.66813    0.82988   73.11   <2e-16 ***
## `percent associates or higher` -0.19163    0.01628  -11.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.507 on 567 degrees of freedom
## Multiple R-squared:  0.1963, Adjusted R-squared:  0.1949 
## F-statistic: 138.5 on 1 and 567 DF,  p-value: < 2.2e-16

We see that education has a very similar correlation with social distancing to that of income; percent of individuals with an Associate’s degree or higher predicts about 20% of the variability in percent of devices leaving home, with a 1% increase in percent of individuals with a degree above high school level associated with a decrease of 0.2 in percent of devices leaving home. We now compare this to prior to the shelter-in-place order.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `percent associates or higher`,
  y = `% not completely at home pre shelter`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of people with an degree at Associate's level or higher",
    y = "Percent of devices leaving home on weekdays pre-shelter-in-place",
    title = "San Jose: Social Distancing and Education Pre Shelter-in-Place"
  )

educ_model2 <- lm(`% not completely at home pre shelter` ~ `percent associates or higher`, sj_dem_distancing_pre_post)
summary(educ_model2)
## 
## Call:
## lm(formula = `% not completely at home pre shelter` ~ `percent associates or higher`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.2224  -2.5458   0.0672   2.7859  13.9180 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    73.547853   0.476189  154.45   <2e-16 ***
## `percent associates or higher`  0.086343   0.009344    9.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.308 on 567 degrees of freedom
## Multiple R-squared:  0.1309, Adjusted R-squared:  0.1293 
## F-statistic: 85.39 on 1 and 567 DF,  p-value: < 2.2e-16

Again, the trend after the shelter-in-place order was mandated is the reverse of the trend before the order.

Individual variables: Asian population

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `% Asian`,
  y = `% not completely at home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of residents that are Asian",
    y = "Percent of devices leaving home on weekdays since shelter-in-place",
    title = "San Jose: Social Distancing and Asian Population"
  )

asian_model <- lm(`% not completely at home` ~ `% Asian`, sj_dem_distancing_pre_post)
summary(asian_model)
## 
## Call:
## lm(formula = `% not completely at home` ~ `% Asian`, data = sj_dem_distancing_pre_post)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.470  -5.111  -0.548   4.414  35.818 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.52803    0.58366   96.85   <2e-16 ***
## `% Asian`   -0.15086    0.01497  -10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.712 on 567 degrees of freedom
## Multiple R-squared:  0.1519, Adjusted R-squared:  0.1504 
## F-statistic: 101.5 on 1 and 567 DF,  p-value: < 2.2e-16

After the start of the shelter-in-place order, the percent of residents in a block group that are Asian predicts about 15% of the variation in percent of devices leaving home in that block group, with a negative correlation between these two variables, so higher percent Asian is associated with fewer devices leaving the home.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `% Asian`,
  y = `% not completely at home pre shelter`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of residents that are Asian",
    y = "Percent of devices leaving home on weekdays pre-shelter-in-place",
    title = "San Jose: Social Distancing and Asian Population Pre Shelter-in-Place"
  )

asian_model2 <- lm(`% not completely at home pre shelter` ~ `% Asian`, sj_dem_distancing_pre_post)
summary(asian_model2)
## 
## Call:
## lm(formula = `% not completely at home pre shelter` ~ `% Asian`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.6447  -3.0068   0.0283   3.2005  12.1305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 78.069520   0.348967  223.72   <2e-16 ***
## `% Asian`   -0.013871   0.008952   -1.55    0.122    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.611 on 567 degrees of freedom
## Multiple R-squared:  0.004217,   Adjusted R-squared:  0.00246 
## F-statistic: 2.401 on 1 and 567 DF,  p-value: 0.1218

In contrast, prior to the shelter-in-place order, percent of residents that are Asian had no relationship with percent of devices leaving home.

Individual variables: Hispanic/Latino population

We now consider the percent of residents that identify as Hispanic or Latino.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `% non hispanic/latino`,
  y = `% not completely at home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of residents that are not Hispanic or Latino",
    y = "Percent of devices leaving home on weekdays since shelter-in-place",
    title = "San Jose: Social Distancing and Hispanic/Latino Population"
  )

hisp_model <- lm(`% not completely at home` ~ `% non hispanic/latino`, sj_dem_distancing_pre_post)
summary(hisp_model)
## 
## Call:
## lm(formula = `% not completely at home` ~ `% non hispanic/latino`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.904  -4.623  -0.719   3.867  37.779 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             62.51099    1.00802   62.01   <2e-16 ***
## `% non hispanic/latino` -0.15990    0.01407  -11.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.557 on 567 degrees of freedom
## Multiple R-squared:  0.1856, Adjusted R-squared:  0.1842 
## F-statistic: 129.2 on 1 and 567 DF,  p-value: < 2.2e-16

After the start of the shelter-in-place order, the percent of residents in a block group that are not Hispanic or Latino predicts about 18% of the variation in percent of devices leaving home in that block group, with higher percent not Hispanic or Latino associated with fewer devices leaving the home.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `% non hispanic/latino`,
  y = `% not completely at home pre shelter`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of residents that are not Hispanic or Latino",
    y = "Percent of devices leaving home on weekdays pre-shelter-in-place",
    title = "San Jose: Social Distancing and Hispanic/Latino Population Pre Shelter-in-Place"
  )

hisp_model2 <- lm(`% not completely at home pre shelter` ~ `% non hispanic/latino`, sj_dem_distancing_pre_post)
summary(hisp_model2)
## 
## Call:
## lm(formula = `% not completely at home pre shelter` ~ `% non hispanic/latino`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.6826  -2.6760  -0.0257   3.0167  16.7253 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             73.005160   0.581580 125.529  < 2e-16 ***
## `% non hispanic/latino`  0.067819   0.008115   8.357 4.97e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.36 on 567 degrees of freedom
## Multiple R-squared:  0.1097, Adjusted R-squared:  0.1081 
## F-statistic: 69.85 on 1 and 567 DF,  p-value: 4.967e-16

Again, the relationship is inverted when we look at behavior prior to the shelter-in-place order.

Individual variables: High speed internet access

Here we consider the percent of households that have access to high speed internet. This analysis was inspired by the paper on Social Distancing, Internet Access and Inequality by Chiou and Tucker (https://www.nber.org/papers/w26982) that found that the combination of high speed internet access and high income was the key driver of ability to stay at home.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `percent high speed`,
  y = `% not completely at home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of households with broadband such as cable, fiber optic or DSL",
    y = "Percent of devices leaving home on weekdays since shelter-in-place",
    title = "San Jose: Social Distancing and High Speed Internet"
  )

internet_model <- lm(`% not completely at home` ~ `percent high speed`, sj_dem_distancing_pre_post)
summary(internet_model)
## 
## Call:
## lm(formula = `% not completely at home` ~ `percent high speed`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.074  -4.591  -0.510   3.891  38.005 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          73.71276    2.11651   34.83   <2e-16 ***
## `percent high speed` -0.27418    0.02598  -10.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.656 on 567 degrees of freedom
## Multiple R-squared:  0.1642, Adjusted R-squared:  0.1627 
## F-statistic: 111.4 on 1 and 567 DF,  p-value: < 2.2e-16

High speed internet access explains about 16% of the variation in the percent of devices leaving home after shelter-in-place, with greater percentage of households with high speed internet access associated with fewer devices leaving home. We again compare this to the behavior prior to shelter-in-place.

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `percent high speed`,
  y = `% not completely at home pre shelter`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of households with broadband such as cable, fiber optic or DSL",
    y = "Percent of devices leaving home on weekdays pre-shelter-in-place",
    title = "San Jose: Social Distancing and High Speed Internet Pre Shelter-in-Place"
  )

internet_model2 <- lm(`% not completely at home pre shelter` ~ `percent high speed`, sj_dem_distancing_pre_post)
summary(internet_model2)
## 
## Call:
## lm(formula = `% not completely at home pre shelter` ~ `percent high speed`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.5136  -2.8718  -0.1496   2.8726  16.4347 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          69.96222    1.23534  56.634  < 2e-16 ***
## `percent high speed`  0.09508    0.01516   6.271 7.12e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.468 on 567 degrees of freedom
## Multiple R-squared:  0.06486,    Adjusted R-squared:  0.06321 
## F-statistic: 39.32 on 1 and 567 DF,  p-value: 7.121e-10

Before shelter-in-place, the relationship was much weaker, and, if present at all, reversed relative to after shelter-in-place.

Multiple regression analyses

Some of the variables considered individually above may be related to one another, and thus the trends shown above may actually be driven by underlying correlations with other demographic variables. To assess this, we now consider performing multiple regression analyses on these data; this involves assessing the ability of a combination of multiple demographic variables to predict the social distancing compliance data. Combining multiple variables into a model may either demonstrate that all the variables included have some explanatory power, or may indicate that once the effect of one or more of the variables is accounted for, some of the other variables lose their predictive ability.

Of the individual demographic variable correlations presented above, income was the strongest predictor; thus, we focus on combining other variables with income to obtain a better fit.

Income and Hispanic/Latino population

We first perform a multiple regression analysis using income and percent of residents that are not Hispanic or Latino as our independent variables.

income_hisplat_model <- lm(sj_dem_distancing_pre_post$`% not completely at home` ~ sj_dem_distancing_pre_post$`% over 125,000` + sj_dem_distancing_pre_post$`% non hispanic/latino`)
summary(income_hisplat_model)
## 
## Call:
## lm(formula = sj_dem_distancing_pre_post$`% not completely at home` ~ 
##     sj_dem_distancing_pre_post$`% over 125,000` + sj_dem_distancing_pre_post$`% non hispanic/latino`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.871  -4.679  -1.088   3.863  30.651 
## 
## Coefficients:
##                                                    Estimate Std. Error t value
## (Intercept)                                        63.17702    0.96438  65.511
## sj_dem_distancing_pre_post$`% over 125,000`        -0.15808    0.02066  -7.653
## sj_dem_distancing_pre_post$`% non hispanic/latino` -0.07427    0.01746  -4.254
##                                                    Pr(>|t|)    
## (Intercept)                                         < 2e-16 ***
## sj_dem_distancing_pre_post$`% over 125,000`        8.50e-14 ***
## sj_dem_distancing_pre_post$`% non hispanic/latino` 2.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.2 on 566 degrees of freedom
## Multiple R-squared:  0.262,  Adjusted R-squared:  0.2594 
## F-statistic: 100.5 on 2 and 566 DF,  p-value: < 2.2e-16

When combined, income and percent of residents that are Hispanic or Latino predicts 26% of the variation in the percent of devices leaving the home. Both variables continue to be significant predictors. Note, however, that this is only a 2% increase over the R-squared for income alone, so the model has not gained strong predictive power with the addition of percent of residents that are Hispanic or Latino.

Income, Hispanic/Latino population, and Asian population

We now add in percent of residents that are Asian to the income and non-Hispanic/Latino analysis.

income_hisplat_educ_model <- lm(sj_dem_distancing_pre_post$`% not completely at home` ~ sj_dem_distancing_pre_post$`% over 125,000` + sj_dem_distancing_pre_post$`% non hispanic/latino` + sj_dem_distancing_pre_post$`% Asian`)
summary(income_hisplat_educ_model)
## 
## Call:
## lm(formula = sj_dem_distancing_pre_post$`% not completely at home` ~ 
##     sj_dem_distancing_pre_post$`% over 125,000` + sj_dem_distancing_pre_post$`% non hispanic/latino` + 
##         sj_dem_distancing_pre_post$`% Asian`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9178  -4.4681  -0.6588   3.9115  29.9193 
## 
## Coefficients:
##                                                    Estimate Std. Error t value
## (Intercept)                                        63.74326    0.92274  69.080
## sj_dem_distancing_pre_post$`% over 125,000`        -0.17586    0.01984  -8.865
## sj_dem_distancing_pre_post$`% non hispanic/latino` -0.01814    0.01823  -0.995
## sj_dem_distancing_pre_post$`% Asian`               -0.11263    0.01488  -7.571
##                                                    Pr(>|t|)    
## (Intercept)                                         < 2e-16 ***
## sj_dem_distancing_pre_post$`% over 125,000`         < 2e-16 ***
## sj_dem_distancing_pre_post$`% non hispanic/latino`     0.32    
## sj_dem_distancing_pre_post$`% Asian`               1.52e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.867 on 565 degrees of freedom
## Multiple R-squared:   0.33,  Adjusted R-squared:  0.3264 
## F-statistic: 92.75 on 3 and 565 DF,  p-value: < 2.2e-16

This model predicts 33% of the variation in the social distancing compliance data, with income slightly more impactful than Asian population, and non-Hispanic/Latino population no longer significant with the inclusion of percent of residents that are Asian. Based on this result, we continue to use Asian population but not non-Hispanic/Latino population in our further models.

Best overall predictor of the data: Income, education, Asian population, and child population

Proceeding similarly, we identified the variables that, when combined, predicted the greatest percentage of the variation in the percent of devices leaving home across blockgroups. This model incorporates income (percent of households earning more than 125,000), education (percent of residents with a degree at the Associate’s level or higher), percent of residents that are Asian, and percent of residents that are children.

raw_movement_best <- lm(sj_dem_distancing_pre_post$`% not completely at home` ~ sj_dem_distancing_pre_post$`% over 125,000` + sj_dem_distancing_pre_post$`percent associates or higher` + sj_dem_distancing_pre_post$`percent less than 18` + sj_dem_distancing_pre_post$`% Asian`)
summary(raw_movement_best)
## 
## Call:
## lm(formula = sj_dem_distancing_pre_post$`% not completely at home` ~ 
##     sj_dem_distancing_pre_post$`% over 125,000` + sj_dem_distancing_pre_post$`percent associates or higher` + 
##         sj_dem_distancing_pre_post$`percent less than 18` + sj_dem_distancing_pre_post$`% Asian`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.396  -4.224  -0.753   3.877  33.110 
## 
## Coefficients:
##                                                           Estimate Std. Error
## (Intercept)                                               69.48332    1.34831
## sj_dem_distancing_pre_post$`% over 125,000`               -0.12742    0.02121
## sj_dem_distancing_pre_post$`percent associates or higher` -0.08356    0.02139
## sj_dem_distancing_pre_post$`percent less than 18`         -0.21445    0.04314
## sj_dem_distancing_pre_post$`% Asian`                      -0.11891    0.01347
##                                                           t value Pr(>|t|)    
## (Intercept)                                                51.534  < 2e-16 ***
## sj_dem_distancing_pre_post$`% over 125,000`                -6.007 3.39e-09 ***
## sj_dem_distancing_pre_post$`percent associates or higher`  -3.906 0.000105 ***
## sj_dem_distancing_pre_post$`percent less than 18`          -4.971 8.86e-07 ***
## sj_dem_distancing_pre_post$`% Asian`                       -8.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.69 on 564 degrees of freedom
## Multiple R-squared:  0.3652, Adjusted R-squared:  0.3607 
## F-statistic: 81.11 on 4 and 564 DF,  p-value: < 2.2e-16

We see that these four variables predict about 36% of the variation in percent of devices leaving the home. Note again that income alone predicted about 24% of this variation, and income and Asian residents predicted 33%, so adding education and child residents added only about 3% explanatory power. All variables correlate in the same direction with percent of devices leaving home; that is, increasing percent of households earning over 125,000, percent of residents with higher education attainment, percent of residents that are children, or percent of residents that are Asian leads to a decrease in percent of devices leaving the home.

Stay-at-home increase metric

To be finished.

Other results: variables without strong correlations

Variables that we found to not have strong correlations with social distancing, neither on their own nor when combined with other variables in multiple regression analyses, include occupants per room in a household, percent of the population that is elderly, percent of the population that is white, vehicle availability, and percent of the workforce that is male. We demonstrate the lack of correlation for percent of the population that is elderly as well as for vehicle availability, plotted below.

Elderly population

After and before shelter-in-place behavior (raw movement metric):

sj_dem_distancing_pre_post %>% filter(`percent elderly` < 50) %>% # get rid of extreme outliers
  ggplot(aes(
  x = `percent elderly`,
  y = `% not completely at home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of residents 65 and older",
    y = "Percent of devices leaving home on weekdays since shelter-in-place",
    title = "San Jose: Social Distancing and Elderly Population"
  )

elderly_model <- lm(`% not completely at home` ~ `percent elderly`, sj_dem_distancing_pre_post %>% filter(`percent elderly` < 50))
summary(elderly_model)
## 
## Call:
## lm(formula = `% not completely at home` ~ `percent elderly`, 
##     data = sj_dem_distancing_pre_post %>% filter(`percent elderly` < 
##         50))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.854  -5.158  -0.420   4.239  35.422 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       53.85340    0.77548  69.446  < 2e-16 ***
## `percent elderly` -0.17447    0.05355  -3.258  0.00119 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.281 on 564 degrees of freedom
## Multiple R-squared:  0.01848,    Adjusted R-squared:  0.01674 
## F-statistic: 10.62 on 1 and 564 DF,  p-value: 0.001188
sj_dem_distancing_pre_post %>% filter(`percent elderly` < 50) %>% # get rid of extreme outliers
  ggplot(aes(
  x = `percent elderly`,
  y = `% not completely at home pre shelter`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of residents 65 and older",
    y = "Percent of devices leaving home on weekdays pre-shelter-in-place",
    title = "San Jose: Staying at Home and Elderly Population Pre Shelter-in-Place"
  )

elderly_model2 <- lm(`% not completely at home pre shelter` ~ `percent elderly`, sj_dem_distancing_pre_post %>% filter(`percent elderly` < 50))
summary(elderly_model2)
## 
## Call:
## lm(formula = `% not completely at home pre shelter` ~ `percent elderly`, 
##     data = sj_dem_distancing_pre_post %>% filter(`percent elderly` < 
##         50))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.236  -2.830  -0.158   3.145  14.296 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        75.9045     0.4257 178.295  < 2e-16 ***
## `percent elderly`   0.1329     0.0294   4.522 7.47e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.546 on 564 degrees of freedom
## Multiple R-squared:  0.03499,    Adjusted R-squared:  0.03328 
## F-statistic: 20.45 on 1 and 564 DF,  p-value: 7.466e-06

Change in behavior (stay at home increase metric):

sj_dem_distancing_pre_post %>% filter(`percent elderly` < 50) %>% # get rid of extreme outliers
  ggplot(aes(
  x = `percent elderly`,
  y = `% increase in staying completely home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of residents 65 and older",
    y = "Dif in % completely at home after shelter-in-place relative to before",
    title = "San Jose: Social Distancing and Elderly Population"
  )

elderly_model_dif <- lm(`% increase in staying completely home` ~ `percent elderly`, sj_dem_distancing_pre_post %>% filter(`percent elderly` < 50))
summary(elderly_model_dif)
## 
## Call:
## lm(formula = `% increase in staying completely home` ~ `percent elderly`, 
##     data = sj_dem_distancing_pre_post %>% filter(`percent elderly` < 
##         50))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.148  -5.695  -0.102   5.905  31.360 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.05114    0.89393   24.67  < 2e-16 ***
## `percent elderly`  0.30740    0.06172    4.98 8.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.545 on 564 degrees of freedom
## Multiple R-squared:  0.04212,    Adjusted R-squared:  0.04043 
## F-statistic:  24.8 on 1 and 564 DF,  p-value: 8.453e-07

These results are not strongly significant on their own, as the percent of residents that are elderly only predicts about 4% of the variation in the data.

Vehicle availability

After and before shelter-in-place behavior (raw movement metric):

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `percent with vehicles`,
  y = `% not completely at home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of housholds with vehicles available",
    y = "Percent of devices leaving home on weekdays since shelter-in-place",
    title = "San Jose: Social Distancing and Vehicle Availability"
  )

vehicles_model <- lm(`% not completely at home` ~ `percent with vehicles`, sj_dem_distancing_pre_post)
summary(vehicles_model)
## 
## Call:
## lm(formula = `% not completely at home` ~ `percent with vehicles`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.780  -5.045  -0.352   4.820  38.120 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              77.3251     4.3171   17.91  < 2e-16 ***
## `percent with vehicles`  -0.2705     0.0453   -5.97 4.18e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.123 on 567 degrees of freedom
## Multiple R-squared:  0.05914,    Adjusted R-squared:  0.05748 
## F-statistic: 35.64 on 1 and 567 DF,  p-value: 4.182e-09
# compare to pre shelter in place
sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `percent with vehicles`,
  y = `% not completely at home pre shelter`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of housholds with vehicles available",
    y = "Percent of devices leaving home on weekdays pre shelter-in-place",
    title = "San Jose: Social Distancing and Vehicle Availability Pre Shelter-in-Place"
  )

vehicles_model2 <- lm(`% not completely at home pre shelter` ~ `percent with vehicles`, sj_dem_distancing_pre_post)
summary(vehicles_model2)
## 
## Call:
## lm(formula = `% not completely at home pre shelter` ~ `percent with vehicles`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.2686  -3.0020  -0.1003   3.0894  12.3768 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             68.07279    2.42269  28.098  < 2e-16 ***
## `percent with vehicles`  0.10049    0.02542   3.953  8.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.558 on 567 degrees of freedom
## Multiple R-squared:  0.02682,    Adjusted R-squared:  0.0251 
## F-statistic: 15.62 on 1 and 567 DF,  p-value: 8.701e-05

Change in behavior (stay at home increase metric):

sj_dem_distancing_pre_post %>% 
  ggplot(aes(
  x = `percent with vehicles`,
  y = `% increase in staying completely home`
)) + geom_point() + 
  geom_smooth(method=lm) + 
  labs(
    x = "Percent of housholds with vehicles available",
    y = "Dif in % completely at home after shelter-in-place relative to before",
    title = "San Jose: Social Distancing and Vehicle Availability"
  )

vehicles_model_dif <- lm(`% increase in staying completely home` ~ `percent with vehicles`, sj_dem_distancing_pre_post)
summary(vehicles_model_dif)
## 
## Call:
## lm(formula = `% increase in staying completely home` ~ `percent with vehicles`, 
##     data = sj_dem_distancing_pre_post)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.781  -5.993  -0.092   5.493  30.166 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -9.2523     4.9742  -1.860   0.0634 .  
## `percent with vehicles`   0.3709     0.0522   7.107 3.59e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.359 on 567 degrees of freedom
## Multiple R-squared:  0.08179,    Adjusted R-squared:  0.08017 
## F-statistic: 50.51 on 1 and 567 DF,  p-value: 3.586e-12

Similarly, vehicle availability does not well predict much of the variation in the social distancing data.